Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

نویسندگان

  • Nguyen Xuan Vinh
  • Julien Epps
  • James Bailey
چکیده

Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Clusterings by the Variation of Information

This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering C to clustering C′. The criterion makes no assumptions about how the clusterings were generated and applies to both soft and hard clusterings....

متن کامل

Comparing Clusterings – an information based distance

This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering C to clustering C′. The basic properties of VI are presented and discussed. We focus on two kinds of properties: (1) those that help one build...

متن کامل

Adjusting for Chance Clustering Comparison Measures

Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and gu...

متن کامل

Comparing hard and overlapping clusterings

Similarity measures for comparing clusterings is an important component, e.g., of evaluating clustering algorithms, for consensus clustering, and for clustering stability assessment. These measures have been studied for over 40 years in the domain of exclusive hard clusterings (exhaustive and mutually exclusive object sets). In the past years, the literature has proposed measures to handle more...

متن کامل

Title in English: Methods for Comparing Subspace Clusterings

of Licentiate's thesis Abstract: Subspace clustering methods aim to find groups of similar data points in various subspaces of the original data space. They combine and generalize clustering and feature extraction. Subspace clustering methods are becoming more and more popular , and new algorithms are being published at an increasing rate. These algorithms have been successfully applied for ins...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Machine Learning Research

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2010